Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe pull request adds support for Qwen3.5 MoE quantization across the export pipeline, improves error reporting during quantized weight export with module-aware diagnostics, introduces a patching mechanism to handle transformer-related weight conversion issues, enhances tokenizer encoding with truncation parameters, and relocates custom model file copying in PTQ workflows. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #897 +/- ##
=======================================
Coverage 72.03% 72.03%
=======================================
Files 207 207
Lines 22718 22718
=======================================
Hits 16365 16365
Misses 6353 6353 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modelopt/torch/export/layer_utils.py (2)
1085-1102:⚠️ Potential issue | 🟠 MajorZero amax values are still collected into
valid_amax_values, causing them to silently perpetuate when all experts are uncalibrated.The new
needs_amaxcondition (line 1142) correctly identifies all-zero amax tensors as invalid. However, thevalid_amax_valuescollection loop (lines 1089–1098) only checksexisting_amax is not None, so zero tensors are collected. When every expert hasamax == 0:
valid_amax_values=[0, 0, ...]target_amax = torch.max(stack([0, ...]))=0elif target_amax is Nonebranch (line 1105) is skipped — weight-stat fallback never runsneeds_amax = True→quantizer.amaxis set to0again- Warning misleadingly says "Setting it to 0.000000 (max from existing quantizers in current batch)"
The fix is to mirror the
needs_amaxpredicate in the collection loop:🐛 Proposed fix
valid_amax_values = [] for _, attr_name, quantizer in all_quantizers: existing_amax = getattr(quantizer, "amax", None) - if existing_amax is not None: + if existing_amax is not None and not ( + isinstance(existing_amax, torch.Tensor) and torch.all(existing_amax == 0) + ): # Convert to tensor and add to collection if isinstance(existing_amax, torch.Tensor): valid_amax_values.append(existing_amax.to(target_device)) else: valid_amax_values.append( torch.tensor(existing_amax, dtype=torch.float32, device=target_device) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/layer_utils.py` around lines 1085 - 1102, When collecting existing amax values into valid_amax_values (loop over all_quantizers / existing_amax), only append values that are non-zero using the same predicate as needs_amax: convert existing_amax to a torch.Tensor on target_device first and check it's not all zeros (e.g., tensor.ne(0).any()); skip appending if the tensor is all zeros or None so target_amax won't be set to 0 when all experts are uncalibrated, allowing the weight-stat fallback to run and avoiding misleading warnings for quantizer.amax.
328-344:⚠️ Potential issue | 🟠 Major
get_experts_listdoes not handle Qwen3.5 model type — will raiseNotImplementedErrorat runtime.
is_moeandget_expert_linear_namescorrectly recognizeQwen3_5MoeSparseMoeBlock, butget_experts_list(lines 91–99) dispatches on model type strings extracted astype(model).__name__.lower(). A Qwen3.5 model class likeQwen3_5MoeForCausalLMproduces"qwen3_5moeforcausallm", which is not matched in the function's checks. If quantization uses AWQ or NVFP4_SVDQUANT, the code at line 297 inunified_export_hf.pyexecutesget_experts_list(module, model_type)for any MoE module detected byis_moe, triggeringNotImplementedErrorat line 102.Add
"qwen3_5moeforcausallm"to the model type checks inget_experts_list.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/layer_utils.py` around lines 328 - 344, get_experts_list currently dispatches on type(model).__name__.lower() and lacks handling for Qwen3.5 class names, so add "qwen3_5moeforcausallm" to the model-type checks inside get_experts_list to match the same Qwen3_5 detection used by is_moe and get_expert_linear_names; update the conditional branches that compare model_type (from type(model).__name__.lower()) to include "qwen3_5moeforcausallm" so get_experts_list returns the correct expert list instead of raising NotImplementedError (also verify any other qwen3_5 variants present in that same dispatch and add them if missing).
🧹 Nitpick comments (1)
examples/llm_ptq/hf_ptq.py (1)
649-653: Avoid duplicate custom-file copy in the TensorRT-LLM path.Now that the canonical copy happens at Line 653 (after tokenizer export), the earlier TensorRT-LLM copy at Line 616 becomes redundant and adds extra I/O.
♻️ Proposed cleanup
- # Copy custom model files (Python files and JSON configs) for TensorRT-LLM export - copy_custom_model_files(args.pyt_ckpt_path, export_path, args.trust_remote_code)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 649 - 653, Remove the duplicate custom-file copy in the TensorRT-LLM export path: there is an earlier call to copy_custom_model_files that runs during the TensorRT-LLM branch which becomes redundant because the canonical copy occurs later via copy_custom_model_files(args.pyt_ckpt_path, export_path, args.trust_remote_code) after tokenizer.save_pretrained(); delete the earlier call (and any now-unused surrounding conditional) so custom Python/JSON files are only copied once.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1060-1074: The patch/unpatch sequence around
_patch_revert_weight_conversion/_unpatch_revert_weight_conversion mutates
process globals and must be serialized to avoid cross-export races; add a
module-level lock (e.g., threading.RLock) and acquire it before calling
_patch_revert_weight_conversion, keep it held across model.save_pretrained(...)
and the finally block, then release after _unpatch_revert_weight_conversion so
only one export at a time can patch globals; update any helper initialization to
use the new lock and ensure exceptions still trigger unpatch+release.
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 913-919: The calibration block in _QuantSparseMoe.forward
temporarily sets self.gate.top_k to self.num_experts but restores it only after
calling super(...).forward, so an exception during calibration leaves top_k
mutated; wrap the calibration call in a try/finally: save original_top_k, set
self.gate.top_k = self.num_experts, call super(_QuantSparseMoe,
self).forward(hidden_states) inside try, and in finally always restore
self.gate.top_k = original_top_k to guarantee restoration even on errors.
---
Outside diff comments:
In `@modelopt/torch/export/layer_utils.py`:
- Around line 1085-1102: When collecting existing amax values into
valid_amax_values (loop over all_quantizers / existing_amax), only append values
that are non-zero using the same predicate as needs_amax: convert existing_amax
to a torch.Tensor on target_device first and check it's not all zeros (e.g.,
tensor.ne(0).any()); skip appending if the tensor is all zeros or None so
target_amax won't be set to 0 when all experts are uncalibrated, allowing the
weight-stat fallback to run and avoiding misleading warnings for quantizer.amax.
- Around line 328-344: get_experts_list currently dispatches on
type(model).__name__.lower() and lacks handling for Qwen3.5 class names, so add
"qwen3_5moeforcausallm" to the model-type checks inside get_experts_list to
match the same Qwen3_5 detection used by is_moe and get_expert_linear_names;
update the conditional branches that compare model_type (from
type(model).__name__.lower()) to include "qwen3_5moeforcausallm" so
get_experts_list returns the correct expert list instead of raising
NotImplementedError (also verify any other qwen3_5 variants present in that same
dispatch and add them if missing).
---
Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 649-653: Remove the duplicate custom-file copy in the TensorRT-LLM
export path: there is an earlier call to copy_custom_model_files that runs
during the TensorRT-LLM branch which becomes redundant because the canonical
copy occurs later via copy_custom_model_files(args.pyt_ckpt_path, export_path,
args.trust_remote_code) after tokenizer.save_pretrained(); delete the earlier
call (and any now-unused surrounding conditional) so custom Python/JSON files
are only copied once.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/llm_ptq/hf_ptq.pymodelopt/torch/export/layer_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/utils/dataset_utils.py
| # Temporarily disable revert_weight_conversion if available — it doesn't handle | ||
| # quantized state dicts (scalar scale tensors have 0 dimensions, causing IndexError). | ||
| # We must patch both the source module and the importing module since | ||
| # modeling_utils does `from core_model_loading import revert_weight_conversion`. | ||
| _patches = _patch_revert_weight_conversion() | ||
|
|
||
| try: | ||
| model.save_pretrained( | ||
| export_dir, | ||
| state_dict={**post_state_dict, **(extra_state_dict or {})}, | ||
| save_modelopt_state=save_modelopt_state, | ||
| ) | ||
| finally: | ||
| _unpatch_revert_weight_conversion(_patches) | ||
|
|
There was a problem hiding this comment.
Serialize global patching to avoid cross-export races.
The patch/unpatch sequence mutates module globals process-wide. Concurrent exports can interleave and restore the wrong function, causing flaky behavior.
🔒 Proposed fix (serialize patch window)
+import threading
...
+_REVERT_WEIGHT_CONVERSION_PATCH_LOCK = threading.Lock()
...
- _patches = _patch_revert_weight_conversion()
-
- try:
- model.save_pretrained(
- export_dir,
- state_dict={**post_state_dict, **(extra_state_dict or {})},
- save_modelopt_state=save_modelopt_state,
- )
- finally:
- _unpatch_revert_weight_conversion(_patches)
+ with _REVERT_WEIGHT_CONVERSION_PATCH_LOCK:
+ _patches = _patch_revert_weight_conversion()
+ try:
+ model.save_pretrained(
+ export_dir,
+ state_dict={**post_state_dict, **(extra_state_dict or {})},
+ save_modelopt_state=save_modelopt_state,
+ )
+ finally:
+ _unpatch_revert_weight_conversion(_patches)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Temporarily disable revert_weight_conversion if available — it doesn't handle | |
| # quantized state dicts (scalar scale tensors have 0 dimensions, causing IndexError). | |
| # We must patch both the source module and the importing module since | |
| # modeling_utils does `from core_model_loading import revert_weight_conversion`. | |
| _patches = _patch_revert_weight_conversion() | |
| try: | |
| model.save_pretrained( | |
| export_dir, | |
| state_dict={**post_state_dict, **(extra_state_dict or {})}, | |
| save_modelopt_state=save_modelopt_state, | |
| ) | |
| finally: | |
| _unpatch_revert_weight_conversion(_patches) | |
| # Temporarily disable revert_weight_conversion if available — it doesn't handle | |
| # quantized state dicts (scalar scale tensors have 0 dimensions, causing IndexError). | |
| # We must patch both the source module and the importing module since | |
| # modeling_utils does `from core_model_loading import revert_weight_conversion`. | |
| with _REVERT_WEIGHT_CONVERSION_PATCH_LOCK: | |
| _patches = _patch_revert_weight_conversion() | |
| try: | |
| model.save_pretrained( | |
| export_dir, | |
| state_dict={**post_state_dict, **(extra_state_dict or {})}, | |
| save_modelopt_state=save_modelopt_state, | |
| ) | |
| finally: | |
| _unpatch_revert_weight_conversion(_patches) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/export/unified_export_hf.py` around lines 1060 - 1074, The
patch/unpatch sequence around
_patch_revert_weight_conversion/_unpatch_revert_weight_conversion mutates
process globals and must be serialized to avoid cross-export races; add a
module-level lock (e.g., threading.RLock) and acquire it before calling
_patch_revert_weight_conversion, keep it held across model.save_pretrained(...)
and the finally block, then release after _unpatch_revert_weight_conversion so
only one export at a time can patch globals; update any helper initialization to
use the new lock and ensure exceptions still trigger unpatch+release.
| if any(getattr(m, "_if_calib", False) for m in self.experts.modules()): | ||
| # Force all tokens to all experts during calibration | ||
| original_top_k = self.gate.top_k | ||
| self.gate.top_k = self.num_experts | ||
| super(_QuantSparseMoe, self).forward(hidden_states) | ||
| self.gate.top_k = original_top_k | ||
| return super(_QuantSparseMoe, self).forward(hidden_states) |
There was a problem hiding this comment.
Always restore gate.top_k with finally.
If the calibration warmup forward throws, self.gate.top_k remains at self.num_experts, which can corrupt later routing behavior.
🛠️ Proposed fix
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
if any(getattr(m, "_if_calib", False) for m in self.experts.modules()):
# Force all tokens to all experts during calibration
original_top_k = self.gate.top_k
self.gate.top_k = self.num_experts
- super(_QuantSparseMoe, self).forward(hidden_states)
- self.gate.top_k = original_top_k
+ try:
+ super(_QuantSparseMoe, self).forward(hidden_states)
+ finally:
+ self.gate.top_k = original_top_k
return super(_QuantSparseMoe, self).forward(hidden_states)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if any(getattr(m, "_if_calib", False) for m in self.experts.modules()): | |
| # Force all tokens to all experts during calibration | |
| original_top_k = self.gate.top_k | |
| self.gate.top_k = self.num_experts | |
| super(_QuantSparseMoe, self).forward(hidden_states) | |
| self.gate.top_k = original_top_k | |
| return super(_QuantSparseMoe, self).forward(hidden_states) | |
| if any(getattr(m, "_if_calib", False) for m in self.experts.modules()): | |
| # Force all tokens to all experts during calibration | |
| original_top_k = self.gate.top_k | |
| self.gate.top_k = self.num_experts | |
| try: | |
| super(_QuantSparseMoe, self).forward(hidden_states) | |
| finally: | |
| self.gate.top_k = original_top_k | |
| return super(_QuantSparseMoe, self).forward(hidden_states) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 913 - 919,
The calibration block in _QuantSparseMoe.forward temporarily sets
self.gate.top_k to self.num_experts but restores it only after calling
super(...).forward, so an exception during calibration leaves top_k mutated;
wrap the calibration call in a try/finally: save original_top_k, set
self.gate.top_k = self.num_experts, call super(_QuantSparseMoe,
self).forward(hidden_states) inside try, and in finally always restore
self.gate.top_k = original_top_k to guarantee restoration even on errors.
4219853 to
b8f3f25
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds ModelOpt PTQ (Post-Training Quantization) support for the Qwen3.5-MoE model (Qwen/Qwen3.5-397B-A17B). The changes include new quantization module support, improved error diagnostics, bug fixes for tokenizer handling, and improvements to the model export workflow.
Changes:
- Added
_QuantQwen35MoeExpertsquantization module for Qwen3.5 MoE architecture - Improved MoE layer detection with pattern-based matching (auto-detect modules ending with "sparsemoeblock")
- Enhanced error handling in model export with detailed module information
- Fixed tokenizer encoding to use modern
tokenizer()method instead of deprecatedbatch_encode_plus - Added patching for
revert_weight_conversionto handle quantized state dicts with scalar tensors - Improved zero amax handling in expert quantizer calibration
- Fixed file copy order to preserve original tokenizer files over regenerated ones
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| modelopt/torch/utils/dataset_utils.py | Updated tokenizer call from batch_encode_plus to modern tokenizer() method and improved comment |
| modelopt/torch/quantization/plugins/huggingface.py | Added _QuantQwen35MoeExperts and _Qwen35MoeExpertModule classes for Qwen3.5 MoE support with registration |
| modelopt/torch/export/unified_export_hf.py | Enhanced error messages with module details and added revert_weight_conversion patching to handle scalar tensors |
| modelopt/torch/export/layer_utils.py | Simplified MoE detection with pattern matching, added Qwen3_5MoeSparseMoeBlock to expert linear names, improved amax handling for zero values, and simplified conditional expression |
| examples/llm_ptq/hf_ptq.py | Moved copy_custom_model_files call after tokenizer.save_pretrained to preserve original tokenizer files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with torch.no_grad(): | ||
| module.weight.data = weight.detach().data.to(dtype=dtype, device=device) | ||
|
|
||
| expert_dim = self.intermediate_dim |
There was a problem hiding this comment.
The _QuantQwen35MoeExperts class directly accesses self.intermediate_dim without fallback handling, unlike _QuantQwen3VLMoeTextExperts which checks for both intermediate_size and intermediate_dim attributes. If the Qwen3.5 model uses intermediate_size instead of intermediate_dim, this will cause an AttributeError. Consider adding the same fallback logic used in _QuantQwen3VLMoeTextExperts (lines 671-676).
| expert_dim = self.intermediate_dim | |
| # Support both `intermediate_size` and `intermediate_dim` depending on the model config. | |
| if hasattr(self, "intermediate_size"): | |
| expert_dim = self.intermediate_size | |
| else: | |
| expert_dim = self.intermediate_dim |
9aaca69 to
ffb50e4
Compare
ffb50e4 to
aa60527
Compare
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
6af1cd4 to
c7bb291
Compare
What does this PR do?
Type of change: New model support
Overview: Add ModelOpt PTQ support for https://huggingface.co/Qwen/Qwen3.5-397B-A17B
Usage
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Bug Fixes